比利时专利BE1023661A9 A DATA PROCESSING SYSTEM FOR PROCESSING INTERACTIONS

专利PDF首页>>比利时专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Een dataverwerkingssysteem voor verwerking van data met betrekking tot inkomende interacties. Het dataverwerkingssysteem omvattend een eerste configuratieobject definiërend een verzameling van metrieken; een tweede configuratieobject definiërend een aantal aggregatieprimitieven, door analyse van eisen aan de verzameling van metrieken; een eerste verwerkingseenheid die is ingericht voor ontvangst van data met betrekking tot inkomende interacties, en is ingericht voor berekening in real-time van ten minste één aggregatieprimitieve op basis van deze data met inachtneming van het tweede configuratieobject; een dataopslagelement voor opslag van de berekende ten minste ene aggregatieprimitieve zonder opslag van ruwe data met betrekking tot de inkomende interacties; een tweede verwerkingseenheid die is ingericht voor berekening van één of meer metrieken op basis van de opgeslagen ten minste ene aggregatieprimitieve met inachtneming van het eerste en het tweede configuratieobject. A data processing system for processing data related to incoming interactions. The data processing system comprising a first configuration object defining a set of metrics; a second configuration object defining a number of aggregation primitives, by analyzing requirements for the set of metrics; a first processing unit adapted to receive data relating to incoming interactions, and adapted to calculate in real time at least one aggregation primitive based on this data taking into account the second configuration object; a data storage element for storing the calculated at least one aggregation primitive without storing raw data related to the incoming interactions; a second processing unit that is adapted to calculate one or more metrics based on the stored at least one aggregation primitive taking into account the first and the second configuration object.
公开号:BE1023661A9
申请号:E20165319
申请日:2016-05-03
公开日:2017-07-06
发明作者:Steven NOELS；Luc BURGELMAN；Frank HAMERLINCK
申请人:NGDATA Products NV；
IPC主号:

专利说明:

A DATA PROCESSING SYSTEM FOR PROCESSING INTERACTIONS Field of the invention
The present invention relates to the field of data processing systems. The invention more particularly relates to the field of analysis of large amounts of data.
BACKGROUND OF THE INVENTION
When large amounts of data are present and are constantly being entered into a data storage medium (which may even have been distributed), accurate and rapid analysis of the data in the data storage medium is crucial for users who want to act on information contained therein.
It is important here that the response time to a user request is within a reasonable time limit (e.g. a few seconds). The response must also be based on an accurate representation of the data present in the data storage. The data storage medium can contain a wide variety of data, can be distributed and the stored data can be unstructured. User requests may therefore require a thorough analysis of the data storage medium that has a direct impact on the response time.
Data processing systems are therefore faced with the problem of, on the one hand, requirements for real-time response time, and, on the other hand, requirements for thorough analysis on distributed data storage media that contain an enormous amount of data that is constantly increasing. US 2013/0 013 552 Al discloses an interest-driven business intelligence system. In this data processing system, raw data is stored in a storage medium for raw data, metadata in a storage medium for metadata, and the system comprises an interest-driven data pipeline that is automatically compiled for generation of reporting data using the raw data. The interest-driven data pipeline is compiled based on reporting data requirements that are automatically derived from at least one report specification defined using the metadata. US 2013/0 013 552 A1 therefore wants to have an improved reaction time, and thus improved interactivity, in comparison with existing systems used for storing large amounts of data. Instead of using data pipelines that have to be changed by engineers whenever data requirements change, an interest-driven data pipeline is automatically created when reporting data requirements are changed.
However, there is still room for more efficient data processing systems for processing for analyzing large amounts of data (for example in terms of response time, accuracy and / or flexibility).
Summary of the invention
It is an object of embodiments of the present invention to provide an efficient data processing system.
The above object is achieved by a method and device according to the present invention.
In a first aspect, embodiments of the present invention relate to a data processing system for processing data relating to incoming interactions. The data processing system comprises: a first configuration object defining a set of metrics; a second configuration object defining a number of aggregation primitives, by analyzing requirements for the set of metrics; a first processing unit adapted to receive data relating to incoming interactions, and adapted to calculate in real time at least one aggregation primitive based on this data taking into account the second configuration object; a data storage element for storing the calculated at least one aggregation primitive without storing raw data related to the incoming interactions; a second processing unit adapted to calculate one or more metrics based on the stored at least one aggregation primitive taking into account the first configuration object and the second configuration object.
It is an advantage of embodiments of the present invention that aggregation primitives are defined based on the first configuration object that defines a set of metrics. One metric can be obtained by combining different aggregation primitives. One aggregation primitive can be used to obtain different metrics. It is an advantage of embodiments of the present invention that an intermediate step (e.g., the aggregation primitives being stored) is used to obtain the metrics. Since the stored aggregation primitives can be reused for obtaining different metrics, it is possible to avoid having to perform similar calculations based on the interactions twice. It is an advantage of embodiments of the present invention that aggregation primitives can be calculated and stored in real time. With a request for a metric (i.e. when consulting the data) it is therefore not necessary to process all interaction-related data. The requested metric (s) can be calculated based on the already calculated aggregation primitives. If the calculation and storage of the aggregation primitives takes place in real time, they are ready for calculation of a certain metric on request. The calculation of the metrics is based on simple operators such as sum, count, or MAX, MIN operations. This allows rapid real-time calculation of the metrics using the aggregation primitives. It is an advantage of embodiments of the present invention that the metrics are not calculated in real time upon arrival of the interactions. This relieves the data processing system, leaving more computing power for calculation and storage of the aggregation primitives. The effective metrics are only calculated on request. It is an advantage of embodiments of the present invention that the aggregation primitives are stored and in the data storage element. This makes it possible to request metrics based on aggregation primitives from the past without having to recalculate them. It is an advantage that only the aggregation primitives must be stored and not all extractable data from the interactions. This makes it possible to reduce the required space in the data storage element. It is an advantage of embodiments of the present invention that the data processing system does not have to store the raw data for calculating the metrics. This makes it possible to reduce the space required in the data storage element of the data processing system. It is an advantage of embodiments of the present invention that the aggregation primitives are stored since different metrics can be calculated based on these aggregation primitives. In addition, new metrics can be defined and calculated based on the already existing set of aggregation primitives. In embodiments of the present invention, when the first processing unit and the second processing unit are arranged to perform a task such as receiving data relating to an incoming interaction, this can be implemented by composing a piece of code and then executing it on the first or second processing unit. It is an advantage of embodiments of the present invention that it is not necessary to rebuild the code when a new metric is requested (i.e., a new query). It is an advantage of embodiments of the present invention that the first configuration object makes it possible to define a new set of metrics and that the aggregation primitives can be obtained therefrom without having to reconstruct the code.
The first processing unit can be arranged for retrieving data from an external storage medium. The first processing unit can be arranged for calculating at least one aggregation primitive on the basis of the data requested from the external storage medium.
It is an advantage of embodiments of the present invention that calculation of the aggregation primitives can be done on the basis of data relating to incoming interactions (real-time) as well as on the basis of data stored in an external storage medium. It is an advantage of embodiments of the present invention that these calculations are performed in parallel.
The first configuration object and the second configuration object can be one and the same object.
It is an advantage of embodiments of the present invention that the metrics and aggregation primitives are defined in one and the same object. This makes it easier for a user to add or update a relationship. It is an advantage that only one object has to be changed when a new metric is defined.
The first processing unit and the processing unit can be one and the same processing unit.
The first processing unit can be arranged for subdividing interactions into interaction types.
It is an advantage of embodiments of the present invention that the first processing unit is arranged to distinguish these different interaction types since this makes it possible to calculate different aggregation primitives per interaction type.
The first processing unit can be arranged for grouping aggregation primitives on the basis of the extracted data.
In embodiments of the present invention, the data extracted from the interactions can be subdivided into data from which the aggregation primitives can be calculated and into data representation entities. It is an advantage of embodiments of the present invention that these entities can be used to group the aggregation primitives.
The first processing unit and / or the second processing unit can be arranged for monitoring at least one metric.
It is an advantage of embodiments of the present invention that aspects of the interactions can be monitored in real time. It is an advantage of embodiments of the present invention that monitoring of the metrics can be used for generation of alarms. It is thereby an advantage that the metrics can be calculated on request and that it is not necessary to save them. In embodiments of the present invention, a threshold value is specified for a metric, so that when the metric rises or falls below this threshold, an alarm is generated. This can, for example, indicate that the number of clicks on a website has risen above a predetermined level. The threshold corresponding to a metric can be stored in the data storage element. It is an advantage that the metrics can be regularly updated on request as this also allows ad hoc monitoring of certain aspects of the dynamic behavior of the metrics.
The first processing unit and / or the second processing unit can be configured to generate an alarm based on the monitored at least one metric.
It is an advantage of embodiments of the present invention that changes to the metrics can be reported in real time. The configuration of the first processing unit and / or the second processing unit for generating the alarm is also referred to as the alarm system.
In a second aspect, embodiments of the present invention relate to a method for processing data with respect to incoming interactions. The method comprises: in a first step, definition of a set of metrics in a first configuration object; in a second step, definition of a number of aggregation primitives in a second configuration object by analyzing requirements for the set of metrics; in a third step execution of a process wherein the data relating to incoming interactions is received and wherein at least one aggregation primitive is calculated in real time based on this data taking into account the second configuration object, and wherein the at least one aggregation primitive is stored in a data storage element without storage of raw data related to the incoming interactions; and in a fourth step, execution of a process wherein requests for the at least one metric are received and wherein the at least one metric is calculated based on the stored at least one aggregation primitive taking into account the first configuration object and the second configuration object.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Features of the dependent claims can be combined with features of the independent claims and with features of other dependent claims and not only as expressly set out in the claims.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment (s) described below.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows a block diagram of a data processing system according to an embodiment of the present invention. FIG. 2 is a flow chart illustrating the process steps in a method according to an embodiment of the present invention. FIG. 3 is a flow chart illustrating the message queues in an alarm system according to an embodiment of the present invention. FIG. 4 shows a data stream according to an embodiment of the present invention. FIG. 5 shows an example of a relationship between an interaction and metrics according to an embodiment of the present invention.
The drawings are only schematic and are not limitative. In the drawings, the dimensions of some elements may be exaggerated and not drawn to scale for illustrative purposes.
Any reference characters in the claims should not be considered as limiting the scope.
On the different figures, the same reference numerals refer to the same or analogous elements.
Detailed description of exemplary embodiments
The present invention will be described with respect to particular embodiments and with reference to certain drawings, but is not limited thereby, but only by the claims. The described drawings are only schematic and are not limitative.
The terms first, second and the like in the description and in the claims are used to distinguish between similar elements and not necessarily for describing a sequence, whether temporary, spatial, in the arrangement or in any other way. It will be understood that the terms so used are interchangeable under suitable circumstances and that the embodiments of the invention described in this patent may be executed in a different order than described or illustrated in this patent.
It is to be noted that the term "comprising", used in the claims, should not be interpreted as being limited to the means listed thereafter; this term does not exclude other elements or steps. This should therefore be interpreted as an indication of the presence of the specified elements, integers, steps or components referred to, but does not exclude the presence or addition of one or more other elements, integers, steps or components, or groups thereof. The scope of the expression "a device comprising means A and B" should therefore not be limited to devices that consist exclusively of components A and B. This means that with regard to the present invention, A and B are the only relevant components of the device.
Reference in this specification to "one embodiment" or "an embodiment" means that a particular element, structure or property described in connection with the embodiment is included in at least one embodiment of the present invention. The presence of the expressions "in one embodiment" or "in an embodiment" at various places in this specification therefore does not necessarily refer, but possibly to the same embodiment. Furthermore, the relevant elements, structures or features may be combined in any suitable manner in one or more embodiments, as will be apparent to a person skilled in the art from this disclosure.
Similarly, it should be borne in mind that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped into one embodiment, figure or description thereof for the purpose of streamlining disclosure and facilitating the understanding of one or more more of the various aspects of the invention. However, this method of disclosure should not be construed to reflect an intention that the claimed invention requires more elements than those explicitly mentioned in each claim. On the contrary, the following claims reflect that aspects of the invention are present in less than all elements of a single of the embodiments described above. Thus, the claims after the detailed description are hereby expressly incorporated in this detailed description, wherein each claim stands on its own as a separate embodiment of the present invention.
Furthermore, although some embodiments described in this patent include certain but not other elements included in other embodiments, combinations of elements of different embodiments are within the scope of the invention, and form different embodiments, as will be understood by those skilled in the art in technology. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Numerous details have been set forth in the description of this patent. It is to be understood, however, that embodiments of the invention may be practiced without these details. In other cases, well-known methods, structures and techniques have not been shown in detail so as not to complicate the understanding of this description.
DEFINITIONS
In the context of the present invention, the following terms have the following meaning:
When, in embodiments of the present invention, reference is made to "raw data," reference is made to the data relating to incoming interactions.
When "metrics" is referred to in embodiments of the present invention, user-defined parameters are referred to. The user thus defines the metrics based on the raw data. A metric can be referred to with an identification, the so-called 'metric ID'.
When in embodiments of the present invention reference is made to "aggregation primitives," reference is made to values that are calculated and stored upon arrival of incoming interactions. The metrics can be calculated using the calculated and stored aggregation primitives.
When in embodiments of the present invention, reference is made to "an alarm message queue" is referred to a queue in which changes to a Boolean predicate are placed in the queue. The Boolean predicate is therefore evaluated with regard to one or more metrics.
When in embodiments of the present invention reference is made to "a metric change queue", reference is made to a queue in which metrics possibly affected by modification of an aggregation primitive are displayed.
When in embodiments of the present invention reference is made to "a write interceptor", reference is made to a function for intercepting side effects of the aggregation primitives managed in the data storage medium 130 per entity. The entities are thereby extracted data from the incoming interactions and are used for grouping the aggregation primitives.
When in embodiments of the present invention reference is made to "an interaction recording end point", reference is made to a portion of the first processing unit 120 where data is received. The interaction recording endpoint may be implemented in an application programming interface (API).
In a first aspect, the present invention relates to a data processing system 100 for processing data relating to incoming interactions (e.g., a click stream). An incoming interaction can include various attributes such as, but not limited to, a value associated with the interaction, the name and data type of this value, the time at which the interaction occurred. An illustrative block diagram of such a processing system 100 is shown in FIG. 1.
The processing system 100 comprises a first processing unit configured to receive data relating to incoming interactions, to calculate on the basis thereof at least one aggregation primitive and to output this at least one aggregation primitive. The processing system 100 includes a data storage element 130 for storing the at least one aggregation primitive. This storage of aggregation primitives is grouped per entity, for example per consumer. The processing system 100 also includes a second processing system configured to receive at least one aggregation primitive, and to calculate on the basis thereof at least one metric defined to allow extraction of user relevant data from the data related to the user. the incoming interactions.
In embodiments of the present invention, the data processing system 100 includes a first configuration object 110 that defines a set of metrics. These metrics can be defined based on the data related to the incoming interactions.
In embodiments of the present invention, the interactions can be subdivided into interaction types. This allows the first processing unit 120 to calculate aggregation primitives per interaction type and store the aggregation primitives in the data storage element 130 per consumer and per interaction type.
In embodiments of the present invention, the data relating to incoming interactions can be structured before being used to calculate the aggregation primitives.
The data processing system further comprises a second configuration object 140 that defines a number of aggregation primitives. In the second configuration object, the aggregation primitives are defined and how they should be calculated based on the data related to incoming interactions. This can be done by the first processing unit 120. The aggregation primitives are defined by analyzing requirements for the set of metrics as defined in the first configuration object 110. The first configuration object 110 is analyzed and from which it is concluded which aggregation primitives will or should be stored. The metrics can be calculated later using the aggregation primitives. The first configuration object 110, which defines the metrics, can be defined at start-up and then changed, for example when a different information type is required. After changing the first configuration object 110, the second configuration object 140 is redefined based on the first configuration object 110, so that the aggregation primitives required for calculating the newly defined set of metrics are available.
The first and second configuration object may, for example, be in the form of a file or an internal software structure.
In embodiments of the present invention, the data relating to the incoming interactions, also referred to as the raw data, is extracted from the interactions by a first processing unit 120. The first processing unit 120 additionally calculates at least one aggregation primitive based on the extracted data with consideration of information from the second configuration object 140. This takes place in real time upon arrival of incoming interactions.
In embodiments of the present invention, the calculated aggregation primitives are stored in the data storage element 130. They can be stored at any time. This allows current aggregation primitives and aggregation primitives from the past to be retrieved. However, the raw data is not stored in the data storage element 130. This is not required according to embodiments of the present invention because the at least one metric can be obtained from the aggregation primitives calculated in real time and stored in the data storage element 130.
The stored aggregation primitives serve as the basis for calculating the metrics. In embodiments of the present invention, this can be done by the second processing unit 150 which is adapted to calculate metrics based on the stores at least one aggregation primitive taking into account the first configuration object 110 and the second configuration object 140. A particular aggregation primitive can be reused for calculating different metrics.
In embodiments of the present invention, the aggregation primitives can be stored per dimension. A primary dimension can be the time dimension. In that case, the aggregation primitives are stored per predefined time unit, such as, for example, per day. This makes it possible to calculate metrics taking into account different time windows. For example, the metrics can be calculated on a weekly basis, or on a monthly basis, or on an annual basis, all using the same aggregation primitives. Of course, the predefined time unit must be chosen such that metrics can be determined for the smallest time window in which one is interested. For example, if aggregation primitives are stored and aggregated per day, it will not be possible to extract information on an hourly basis, while conversely, if aggregation primitives are stored and aggregated per hour, it will be possible to extract information on both an hourly and daily basis. .
In embodiments of the present invention, additional dimensions can be used to store the aggregation primitives, such as, for example, merchant category. The aggregation primitives can, for example, be structured using split metrics (e.g. per city and per merchant category). An example of this is shown in FIG. 5. In the example, the incoming interaction 510 includes: - an interaction type 511 (e.g., purchase), - a time stamp 512 (e.g., 2010-05-07T13: 01: 51 + 01: 00) corresponding to a time dimension, - a trader category 513 (eg gifts) corresponding to a category dimension, - a city 514 (eg Ghent) corresponding to an additional category dimension, - an amount 515 (eg 230 euros).
These incoming interactions can lead to the following resulting metrics 520: - A Purchase Amount Per Day metric 521 and an Average Purchase Amount Per Week metric 522; these metrics have a time dimension, - A Total Number of Purchases Per City metric 523, a split metric with city 514 as category dimension, - An Average Purchase Amount, Per Dealer category Per Month 524, a split metric with trader category 513 as a category dimension and with a time dimension.
In embodiments of the present invention, the metrics can be calculated on request. In the same or alternative embodiments, certain metrics can essentially also be calculated immediately upon arrival of a new interaction.
Operators for calculating the metrics from the aggregation primitives are operations that allow a quick calculation from the storage of the aggregation primitives. These operations are preferably distributive operations (i.e., both associative and commutative) such as sum, min., Max., First, last, countDistinct and lag.
The operator 'countDistinct' counts the number of unique values that have been seen under all interactions in the aggregation window. This is based on the probabilistic HyperLogLog, so that large cardinalities can be processed. The count is not exact, but has a good chance of being close to the actual value.
The 'lag' operator makes it possible to obtain the nth value of a reference point. The reference point is configurable so that the user can retrieve the nth previous value since a certain date or a metric with a date value.
The associative and commutative properties make it possible to calculate and store the aggregation primitives during execution (with all incoming interactions) and then calculate the potentially complex metrics at the time of the query based on the aggregation primitives. The aggregation primitives already calculated lead to a reduction in the calculation time of the metrics. The second configuration object 140 may consist of the name and data type of the value (s) to be used for calculating the aggregation primitive as well as the distributive operator (s). A subset of the aggregation primitives can be used to calculate the metrics. An average is an example of an operation that cannot be implemented directly as a distributive operation, although it can be split into two distributive operations (sum and count). The sum and count can be calculated in real time and stored as aggregation primitives. The average is a metric that can be calculated on request based on the sum and the count.
The metrics are defined in the first configuration object 110 and they can be calculated using the aggregation primitives defined in the second configuration object 140. The second processing unit 150 is configured to calculate the metrics based on the at least one aggregation primitive taking into account the first configuration object 110 and the second configuration object 140. The calculation of a metric can be triggered by a request for that metric.
The first configuration object 110 and the second configuration object 140 can be different configuration objects, or they can be implemented in one and the same configuration object.
The first processing unit 120 and the second processing unit 150 can be physically distinguishable processing units, each of which is arranged for, respectively, receiving data relating to incoming interactions, and which are arranged for real-time calculation of at least one aggregation primitive based on these data taking into account the second configuration object 140, and being arranged to receive at least one aggregation primitive and calculating metrics based on that at least one aggregation primitive taking into account the first configuration object 110 and the second configuration object (140). Alternatively, they can be implemented as one and the same processing unit that is arranged to receive data regarding incoming interactions, and is arranged for real-time calculation of at least one aggregation primitive based on this data taking into account the second configuration object 140 and is arranged to calculate metrics based on the stored at least one aggregation primitive taking into account the first configuration object 110 and the second configuration object (140).
In embodiments of the present invention, the functionality of the first processing unit 120 can be distributed over various processing units. This can also be the case for the second processing unit 150.
The first processing unit 120 can also be arranged for retrieving data from an external memory 160, instead of, or in addition to, immediately obtaining raw interaction data. The data stored in the external memory 160 can be structured in the same way as the data originating from the incoming interactions. If the data from the external memory 160 are structured in the same way as the data from the interactions, it is an advantage that calculation of the at least one aggregation primitive based on data from an interaction and calculation of the at least one aggregation primitive based on data from an external memory 160 can be performed in the same way.
In embodiments of the present invention, data relating to incoming interactions can be supplemented with data from other systems. This data can also be used in the calculation of the aggregation primitives. The data related to incoming interactions can be stored in a specific structure.
In embodiments of the present invention, the aggregation primitives are grouped by entity. An example of an entity to which the present invention is not limited is a consumer. As a result, a set of aggregation primitives is assigned for each consumer. For each interaction it is known which consumer it is. Each row in the database can match a different consumer. A number of aggregation primitives can be stored in each row.
In embodiments of the present invention, metrics can be calculated based on the aggregation primitives calculated in real time. It is therefore not necessary to retrieve the interaction data for obtaining a metric. It is an advantage of embodiments of the present invention that the response time to a metric request is reduced by storing, in accordance with embodiments of the present invention, the aggregation primitives.
Typically, processing systems 100 according to embodiments of the present invention should be able to handle large amounts of data, for example, more than 10 TB or even more than 1000 TB. The number of interactions per second with which such processing systems should be able to handle 100 is above 1000 or even above 10,000. The response time to a request for a metric is preferably less than 20 ms or even less than 5 ms or even less than 0.1 ms.
Data processing systems according to embodiments of the present invention can be organized in a distributed configuration that is executed on a number of nodes. This allows the system to be scaled based on the processing capacity by increasing the number of nodes. FIG. 4 shows a possible data stream in a data processing system according to an embodiment of the present invention. Block 401 represents the incoming raw data. The raw data can be mapped by a mapping process 402. To that end, the mapping process uses interaction type configuration information stored in a configuration object 410. The raw data related to incoming interactions is converted into structured data according to a schedule that is stored in the configuration object 410. The mapped data 403 is used as input for a supplemental process 404. The supplemental process can supplement the mapped data 403 with data from other systems, resulting in the supplemented data 405. A process 406 that is executed in the first processing unit 120, is responsible for processing the supplemented data 405 and calculating aggregation primitives with regard to the second configuration object 140, and transferring these aggregation primitives for storage in data storage element 130. The stored aggregation primitives are used as input for a quer y process 407 that is executed in the second processing unit 150. The query process can be configured with query parameters 408, which can be defined by the user, for example. The query process takes into account the first configuration object 110 and the second configuration object 140 for obtaining the metrics 409. The first configuration object 110 defines a collection of metrics based on the incoming interactions. This is represented by the dotted line between the first configuration object 110 and configuration type interaction in the configuration object 410. In embodiments of the present invention, the first processing unit 120 and / or the second processing unit 150 are adapted to monitor at least one aggregation-primitive and / or at least one metric. The metrics may only be used to analyze the dynamic behavior with regard to the incoming interactions. For example, it is possible to check whether a maximum or minimum threshold value has been exceeded or to check a trend in the variation of a metric. In embodiments of the present invention, alarms can be generated based on a certain dynamic behavior of a metric. This can be done by an alarm system that monitors the metrics and checks whether an alarm should be generated. Such an alarm can be a notification that the purchase amount of this consumer has risen above US $ 9000. It is an advantage of embodiments of the present invention that an alarm can be generated based on the metrics that can be calculated using the real-time aggregation primitives calculated. The alarm can be generated in a predefined time period. This time span can for example be limited to less than 5 seconds, more preferably less than 2 seconds, more preferably less than one second after an alarm condition has been triggered. The alarms can be defined per metric, per entity (e.g. consumer) or per collection of entities.
In embodiments of the present invention, the data processing system 100 can automatically produce the alarms during execution. This is done by an alarm system that analyzes the dynamic behavior of at least one of the metrics. The alarms can be generated based on the values of the metrics. Since these values can be calculated in real time, the alarm can be generated as soon as the alarm condition is activated. Generation of the alarms only requires checking of special circumstances of the metrics. Thus, the load on the data processing system 100 is not significantly increased by the alarm system. The alarm system may preferably be disconnected from the type of data storage element 130 and other technological choices made for the data processing system 100. For example, the alarm system may have access to a query service for access to the data storage element 130.
In embodiments of the present invention, an alarm can be defined as a Boolean predicate applied to a combination of metrics. An alarm message is created when a metric changes so that the result of the predicate exceeds the threshold from incorrect to correct or from correct to incorrect. This alarm message can be sent to one or more consumers who are registered for alarm messages.
In an exemplary embodiment of the present invention, the data processing system 100 is implemented to track credit card purchases for different consumers. Interaction types in this case are purchases and web clicks.
In this example, the following data can be extracted from the purchase interaction: the purchase amount, the merchant category, and the city where the transaction took place. The entities used for grouping the aggregation primitives are the merchant category and the city. The aggregation primitives in this case are the sum of the purchase amounts on one day, the number of interactions on one day and the maximum interaction amount on one day. The purchase amounts for a week can then be calculated by summing the values of each day for the specified week. The purchase amounts for a week is a metric that can be obtained using the aggregation primitive "the purchase amount for a day".
The web clicks are another interaction type. For example, an aggregation primitive that can be calculated for this interaction type is the number of web clicks on one day.
The interaction types together with the extractable data are summarized in the table below.
The corresponding aggregation primitives are: - the sum of the purchase amounts on one day (AMOUNT), - the number of interactions on one day (NUMBER), - the maximum interaction amount on one day (MAX.), - the number of web clicks.
The first three can be grouped by the entity trader category and the city entity. In this exemplary embodiment of the present invention, the aggregation primitives are calculated and stored per day. Other time granularities may also be possible.
After a first interaction, the following aggregation primitives are stored in the data storage element 130 by the data processing system 100.
Interaction 1 (date = 01-01-2015):
Aggregation primitives in the data storage element 130 grouped per merchant category and per city after interaction 1:
Interaction 2 (date = 01-01-2015):
Aggregation primitives in the data storage element 130 grouped per merchant category and per city after interaction 2:
Interaction 3 (date = 01-01-2015):
Aggregation primitives in the data storage element 130 grouped per merchant category and per city after interaction 3:
Interaction 4 (date = 01-01-2015):
Aggregation primitives in the data storage element 130 grouped per merchant category and per city after interaction 2:
These are the interactions that took place on 01-01-2015. The aggregation primitives are calculated in real time by the first processing unit 120 upon arrival of the interaction. It is an advantage of embodiments of the present invention that the real-time aggregation primitives calculated can be used to analyze the dynamic behavior of the consumer level metrics with regard to the purchase interactions.
Metrics can be updated on request based on the already calculated aggregation primitives. The metrics are defined in a first configuration object 110 and the aggregation primitives that can be used for their calculation are defined in a second configuration object 140. A metric can be, for example: "What is the maximum interaction amount per city in the past week". This maximum can be obtained from the calculated aggregation primitives stored in the data storage element 130 by collecting the MAX aggregation primitives (Ghent: 12, Ghent: 5, Brussels: 20) and by classification per city entity (Ghent: 12, Brussels: 20 ).
Another metric could be: "What is the average purchase amount of the past week". The purchase interactions are processed in real time by the data processing system 100 and stored in the NUMBER aggregation primitive. The purchase amount is stored in the AMO aggregation primitive. The metric can be obtained by summing the NUMBER aggregation primitives and the AMOUNT aggregation primitives (sum NUMBER: 4, sum AMOUNT: 47). The average metric can then be obtained by dividing the two together (AMOUNT / NUMBER = 47/4 = 11.75). This metric can be calculated on request. It is thereby an advantage of embodiments of the present invention that the metric can be calculated based on the aggregation primitives and that it is not necessary to collect all information from the previous interactions.
A typical alarm can be generated when the sum of the purchase amount of the last 7 days is below a predetermined threshold, for example 100 (sum AmountPurchasesLast7Days <100). This alarm could then be applied to the 'Big Besters' collection of consumers, and configured to trigger a notification in a dashboard when the determined purchase amount of a member of the 'Big Besters' collection of the last 7 days falls below 100 euros.
An alarm can therefore be split into the following predicates: - A Boolean predicate that defines the threshold correctly / incorrectly for the alarm itself. - A membership predicate that determines to which metrics the alarm should be applied. This can be an ad-hoc collection of individual consumers identified by an ID, a defined collection of consumers, or a collection itself. A collection is therefore one of a group of consumers who share the same statistical values. It is possible to define specific metrics, for example, the average of metrics across different consumers in the collection.
Alarm notifications are defined as an alarm action (e.g., sending a message to a notification queue, etc.) for one or more alarms.
The Boolean predicate can be applied to a subset of metrics. For example, a subset of metrics with values that are already above a predetermined level can be selected. A subset of metrics can also be set up based on an entity. This can be an ad hoc collection of individual consumers identified by an ID, a defined collection of consumers, or a collection itself. An alarm can therefore be defined as two predicates: the Boolean predicate and the membership predicate.
In embodiments of the present invention, 4 alarm types can be distinguished: - Monument alarms (other entities can also be considered): An alarm for a specific metric that is defined at the level of a single consumer, and is used to only measure the metrics for that consumer. to evaluate. - Collection alarms: an alarm for a particular metric that is defined at the level of a collection, and that is used to evaluate the metric of the collection (and not the consumers of that collection). - Metric alarms: an alarm for a specific metric that is valid for all consumers. - Alarms for members of a collection: an alarm for a specific metric that is valid for all members of a collection. These alarm types must be re-evaluated each time the membership of the collection is re-determined (this is not the case for metric alarms)
In embodiments of the present invention, the following requirements may apply individually or in combination: - An alarm is defined on a Boolean predicate that is calculated using at least one metric. - An alarm can have multiple statistical validations that can be combined using AND, OR and NOT. - The metric for which the alarm is defined can be evaluated by comparing it with a static value or another metric. - Alarms can be defined at different entity levels (companies and consumers can in this case be, for example, entities). - Multiple alarms can be defined for the same metric. - The status of an alarm can be reset to trigger alarm events whenever the alarm conditions are met.
In embodiments of the present invention, the core of the top-level alarm system is formed by three main functionalities. These are: - The definition of alarms - The calculation / creation / sending of alarm messages - The consumption of alarm messages.
In embodiments of the present invention, an alarm in a permanent storage medium is defined as two predicates: the Boolean predicate and the membership predicate as defined above.
Write interceptors, e.g., in the form of co-processors, can intercept all write actions from aggregation primitives to an aggregate structure, and evaluate which metrics will be affected by that write action. The metric ID, time stamp of the write action, and the list of potential metrics affected by the write action are added to a message queue (called the "metric change queue" 321). Note that the metric change queue has no knowledge of alarms - it only provides messages that one or more statistical values are likely to have changed. Where possible, a hint can also be provided about the change (eg increased or decreased).
An additional process 313, which can be executed on a regular (e.g. daily) basis, can also send messages to the metric change queue for each metric with a list of metrics that may have changed with the passage of time (and not by processing an interaction). In embodiments of the present invention, an offline statistical value expiration process 313 is responsible for expiring statistical values based on a time window. For example, a sum Amount Last 7 Days stays at a constant value for the 7 days after a single purchase is made, and then drops to 0 without processing any further interactions. The offline statistical process expiration process 313 lists all potentially changed metrics based on the current time (as well as their dependent formula metrics), creates a list of these metric names, and creates messages for each metric. This process can be optimized by only sending metric change messages for metrics that have interacted in the run-up time (eg if a metric has been completely inactive during the past 6 months, there is no possibility that the sum AmountLast7 Days of it will change). FIG. 3 is a flowchart illustrating message queues in an alarm system 300 according to an embodiment of the present invention. A consumer of the metric change queue 321 receives messages about metric changes. The metric change queue 321 may receive input from different processes. The input may be from an endpoint of the metric calculator 311, and / or from the write interceptor of an entity field 312 and / or the decay process of the statistical value 313. The endpoint of the metric calculator 311 thereby processes the incoming interaction data 301 and the write interceptor of an entity field 312 processes the incoming entity data 302. If a metric change message is received for a metric that has one or more configured alarms for it, it is checked for each applicable alarm whether its condition can be changed by one of the metrics that was changed. If the alarm condition for the metric is potentially changed due to the changed statistical values, the statistical values from just before and just after the time stamp are requested in the metric change message. The Boolean predicate can be calculated for the values before and after the change, and if the result of the predicate is different between the two timestamps, a positive (from incorrect to correct) or a negative (from correct to incorrect) message is added to the alarm message queue 333. One or more consumers of the alarm message queue 334 consume message messages from the alarm message queue 333. These consumers 334 may take some sort of action (saving the alarm message in an external system, sending a push message to an external system, etc.) . Combination of multiple alarm messages and comparable functionality can also take place here.
It is an advantage of embodiments of the present invention that operations for generation of metric change messages are minimal. The workload can be minimized by evaluating a metric only at a time resolution at which it changes and not at a smaller time resolution.
In embodiments of the present invention, the alarm manager 332 uses a query service to obtain the metrics. It is an advantage of embodiments of the present invention that the alarm manager 322 is thereby disconnected from the rest of the data processing system. In embodiments of the present invention, the metric change queue 321 can be used for matters other than alarms. For example, the metric change queue can be used to determine whether a consumer should be analyzed for changed membership of the collection.
In embodiments of the present invention, alarms are defined as records in an alarm table. An alarm has a name, a Boolean predicate, and a membership predicate (a list of consumer IDs or a list of collection IDs). These definitions can be stored in an alarm configuration object 331 that is used by the alarm manager 332.
The write interceptor may be inherent to the interaction recording endpoint, the interaction recording endpoint being the component that writes data from the first processing unit 120 to the data storage element 130. For each incoming interaction that is written, a collection can be made of the names of all aggregation primitives and their dependent formulas that may be affected by the change. This is essentially the reverse of the operation where it is decided which aggregation primitive must be read to calculate a metric. The interaction recording endpoint can act as a producer for the metric change queue, and write the relevant message (with metric ID, and change direction hints) to the metric change queue. In addition, it is an advantage that an update of a metric is only evaluated for specified metrics and consumers for which alarms must be generated. It is therefore an advantage that it is not necessary to evaluate every metric change for every consumer.
In embodiments of the present invention, the data processing system includes an alarm manager 332. The alarm manager 332 is a consumer of the metric change queue 321, and is responsible for calculating the values before and after the change of the statistical value to determine if a metric is an alarm threshold in exceeded one of the two directions. The alarm manager 332 is not necessarily a single process - it can be distributed, but the underlying message queue is responsible for ensuring that all messages for the same metric are always sent to the same instance of the alarm manager 332. In embodiments of the present invention, the data processing system can detect fine-grained recent changes to metrics. For example, the data processing system can request the exact value of the sum Amount Last 7 Days exactly 5 seconds ago, but also 4 seconds ago. The alarm manager 332 can request the metric before and after a metric change message. This can increase the load on the data processing system. Therefore, the alarm manager 332 can use various strategies to minimize the number of queries performed on the data processing system. The first strategy to be used to reduce the additional load on queries is grouped sending of multiple metric change messages in single queries to the data processing system. For example, think of 1000 incoming change messages per second. Instead of sending 1000 separate queries to the data processing system, they can be merged into a single batch request for the data processing system with a filter on the metric IDs. The filter on the metric IDs only filters those metric IDs that are related to an alarm. The batch requests are executed to prevent queries from being executed at a higher speed than the expected speed of the changes. It is an advantage of embodiments of the present invention that grouping of requests saves a considerable amount of resources in the data processing system, since each individual request (either for a single entity or batch) must evaluate how the query must be met and create read requests for the necessary data. When grouping metric requests from a number of around 1000, for example, the prior evaluation should only be performed once per 1000 entities. In addition, for the small chance that multiple metric change messages for the same metric arrive within the same second, this allows a large deduplication of effort.
In embodiments of the present invention, the alarm state of a metric can be cached. For a metric, for example, the alarm for sum Amount Last 7 Days may be activated. This fact can be cached in the alarm manager 332's memory. It is an advantage that further incoming metric change messages can then be ignored unless they contain the hint that the sum Amount Last 7 Days (potential) has decreased.
If multiple metrics are used for an alarm condition, further optimizations can be applied by caching the last read value that led to the Boolean predicate failing. Consider, for example, an alarm that is defined as 'sum AmountLast7Days> 100 and contractCount> 10'. When evaluating this predicate on a metric and in case both parts of the predicate fail, it is possible to wait for at least one metric change message for sum Amount Last 7 Days and at least one variable change message for contract Count to be received before applying another query to this metric . This method can be applied when the Boolean predicate can be split into multiple AND-ed components. In embodiments of the present invention, an additional optimization that is applied in the case of multiple AND-ed components forces the cached alarm condition by the data processing unit as an additional filter. By checking the IDs that are returned, it can already be determined whether the alarm condition has changed.
This optimization benefits from the fact that field-based filters are applied in the data processing system before aggregation primitives are calculated, and that filters are applied to aggregation primitives before formulas are calculated. In this way the calculation of statistical values can be shorted within the data processing system.
It is an advantage of embodiments of the present invention that alarm conditions can be split and that alarm conditions that are easily achieved have little impact on the data processing system.
In embodiments of the present invention, the metric change queue 321 and the alarm message queue 333 are separate items in a queue. The metric change queue 321 is subdivided, and metric change messages contain the metric ID as a key, causing all messages for a single metric ID to be sent to the same consumer. This facilitates the caching and optimizations discussed in the previous section. The alarm message queue 333 is an interface for integrating external systems with the data processing system 100. By filtering and processing the side effects of statistical values in alarm messages, there is much less traffic in this queue and therefore easier integration with external systems is possible. Typical standardized queue mechanisms are used for integration of the data processing system 100 with an external system. JMS is a typical example thereof.
In a second aspect, the present invention relates to a method 200 for processing data relating to incoming interactions. An example of this is illustrated in FIG. 2.
In a first configuration step 210, metrics are defined for use for querying the data related to incoming interactions. Different metrics will be defined depending on the type of information that a user wants to extract from the data regarding incoming interactions. A number of metrics can be defined within a single system.
In a second configuration step 220, aggregation primitives are defined, taking into account a relationship between the aggregation primitives and the defined metrics, and taking into account a relationship between data regarding incoming interactions and aggregation primitives. The aggregation primitives can thereby be configured to be calculated from the interaction data using simple, purely distributive operations (e.g., sum, max., Min.). The operators for obtaining the metrics can be more complex since these metrics are ultimately calculated on request and it is not necessary that they are calculated on a real-time basis upon arrival of incoming interactions.
A third step 230 comprises execution of a process wherein the data relating to incoming interactions is received and wherein at least one aggregation primitive is calculated in real time based on this data. The process therefore uses the relationship between the incoming interactions and the aggregation primitives defined in the second configuration step 220. The at least one aggregation primitive is stored in a data storage element without storage of raw data with respect to the incoming interactions. The process for calculation and storage of the aggregation primitives can be carried out continuously.
A fourth step 240 includes execution of a process wherein requests for the at least one metric are received and wherein the at least one metric is calculated based on the relationship between the at least one aggregation primitive and the at least one metric. The metrics can be calculated on request. Certain metrics can also be calculated on arrival of a new interaction.
In embodiments of the present invention, the process for calculating the aggregation primitives and the process for calculating the metrics are performed in parallel.

权利要求:
Claims (9)
[1]
Conclusions
1, - A data processing system for processing data relating to incoming interactions, the data processing system comprising: - a first configuration object defining a set of metrics, - a second configuration object defining a set of aggregation primitives, by analyzing requirements for the set of metrics, - a first processing unit adapted to receive data relating to incoming interactions, and adapted to calculate in real-time at least one aggregation primitive based on this data taking into account the second configuration object, -a data storage element for storing the calculated at least one aggregation primitive without storage of raw data with respect to the incoming interactions, - a second processing unit that is adapted to calculate one or more metrics based on the stored at least one aggregation primitive taking into account the first configuration object and the second configuration object.
[2]
A data processing system according to claim 1, wherein the first processing unit is arranged for retrieving data from an external storage medium and wherein the first processing unit is arranged for calculating at least one aggregation primitive based on the data requested from the external storage medium.
[3]
3. A data processing system according to any one of the preceding claims, wherein the first configuration object and the second configuration object are one and the same object.
[4]
A data processing system according to any one of the preceding claims, wherein the first processing unit and the second processing unit are one and the same processing unit.
[5]
A data processing system according to any one of the preceding claims, wherein the first processing unit is arranged for subdividing interactions into interaction types.
[6]
A data processing system according to any of the preceding claims, wherein the first processing unit is arranged for grouping aggregation primitives on the basis of the extracted data.
[7]
A data processing system according to any one of the preceding claims, wherein the first processing unit and / or the second processing unit is adapted to monitor at least one metric.
[8]
A data processing system according to claim 7, wherein the first processing unit and / or the second processing unit is configured to generate an alarm based on the monitored at least one metric.
[9]
9, - A method for processing data relating to incoming interactions, the method comprising: - defining a set of metrics in a first configuration object, - defining a number of aggregation primitives in a second configuration object by analyzing requirements for the set of metrics, - execution of a process in which the data relating to incoming interactions are received and in which at least one aggregation primitive is calculated in real time on the basis of this data taking into account the second configuration object, and wherein the at least one aggregation primitive is stored in a data storage element without storage of raw data with respect to the incoming interactions, and - execution of a process in which requests for at least one metric are received and wherein the at least one metric is calculated based on the stored at least one aggregation primitive with taking into account the first configuration object and the second configuration object.

类似技术:

公开号 | 公开日 | 专利标题

JP6707564B2|2020-06-10|Data quality analysis

CN107766568B|2021-11-26|Efficient query processing using histograms in columnar databases

US11100435B2|2021-08-24|Machine learning artificial intelligence system for predicting hours of operation

US10089362B2|2018-10-02|Systems and/or methods for investigating event streams in complex event processing | applications

TWI706422B|2020-10-01|Risk control method, device, server and storage medium

US20180240020A1|2018-08-23|Segmentation platform

US10410137B2|2019-09-10|Method and system for analyzing accesses to a data storage type and recommending a change of storage type

CN106649119B|2019-09-20|The test method and device of stream calculation engine

US20200192894A1|2020-06-18|System and method for using data incident based modeling and prediction

Kumar2015|An encyclopedic overview of ‘big data’analytics

US8725461B2|2014-05-13|Inferring effects of configuration on performance

CA2900287C|2019-07-16|Queue monitoring and visualization

CA3080840A1|2020-11-16|System and method for diachronic machine learning architecture

BE1023661A9|2017-07-06|A DATA PROCESSING SYSTEM FOR PROCESSING INTERACTIONS

Emmenegger et al.2012|Improving supply-chain-management based on semantically enriched risk descriptions.

US20150356574A1|2015-12-10|System and method for generating descriptive measures that assesses the financial health of a business

Gaur et al.2014|Assessing the understandability of a data warehouse logical model using a decision-tree approach

Singh2014|Big Data in Capital Markets

Peer et al.2013|Complex events processing: unburdening big data complexities

US11276015B2|2022-03-15|Machine learning artificial intelligence system for predicting hours of operation

CN110688433B|2020-04-21|Path-based feature generation method and device

US20200250231A1|2020-08-06|Systems and methods for dynamic ingestion and inflation of data

Khatiwada2012|Architectural issues in real-time business intelligence

Rajakumar et al.2013|Challenges and Opportunities of big data Analytics in Business Applications

Arampatzis2018|Clustering streaming data in distributed environments based on belief propagation techniques

同族专利:

公开号 | 公开日

US9830339B2|2017-11-28|

US20160328422A1|2016-11-10|

BE1023661A1|2017-06-09|

BE1023661B1|2017-06-09|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US6567803B1|2000-05-31|2003-05-20|Ncr Corporation|Simultaneous computation of multiple moving aggregates in a relational database management system|

US6567804B1|2000-06-27|2003-05-20|Ncr Corporation|Shared computation of user-defined metrics in an on-line analytic processing system|

US8660869B2|2001-10-11|2014-02-25|Adobe Systems Incorporated|System, method, and computer program product for processing and visualization of information|

US20110213663A1|2010-02-26|2011-09-01|Carrier Iq, Inc.|Service intelligence module program product|

US8447721B2|2011-07-07|2013-05-21|Platfora, Inc.|Interest-driven business intelligence systems and methods of data analysis using interest-driven data pipelines|US11120176B2|2017-05-03|2021-09-14|Oracle International Corporation|Object count estimation by live object simulation|

法律状态:

优先权:

申请号 | 申请日 | 专利标题

US14/703,600|2015-05-04|

US14/703,600|US9830339B2|2015-05-04|2015-05-04|Data processing system for processing interactions|

[返回顶部]